While pointwise operations treat each element in a tensor independently, reduction patterns introduce data dependencies where multiple input elements are collapsed into a single output value (e.g., sum, max, or mean). To implement these efficiently, one must bridge the gap between the logical 2D structure of data and its linear representation in hardware memory.
1. 2D Memory Mapping
2D tensors are logically grids but physically linear in RAM. Understanding row-major vs. column-major layout is essential for determining if a reduction traverses contiguous memory addresses or requires strided access.
2. Pointwise vs. Reduction Topology
A matrix copy represents a pointwise operation with a $1:1$ input-to-output mapping. In contrast, a reduction is a many-to-one ($N:1$) operation that necessitates shared accumulation across threads or sequential processing within a block.
3. Dimensionality Collapse
Reductions are defined by the axis of operation. Reducing across axis 1 (rows) versus axis 0 (columns) fundamentally changes memory stride patterns and hardware cache hit rates.